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Reducing the number of data cache accesses improves performance, port efficiency, 
bandwidth and motivates the use of single ported caches instead of complex and 
expensive multi-ported ones. In this paper we consider an intrusion detection system as a 
target application and study the effectiveness of two techniques - (i) prefetching data 
from the cache into local buffers in the processor core and (ii) load Instruction Reuse (IR) 
- in reducing data cache traffic. The analysis is carried out using ... 
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The increasing use of microprocessor cores in embedded systems as well as mobile and 
portable devices creates an opportunity for customizing the cache subsystem for 
improved performance. In traditional cache design, the index portion of the memory 
address bus consists of the K least significant bits, where K &equals; log 2 D and D is the 

depth of the cache. However, in devices where the application set is known and 
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Cache memories have become an essential part of modern processors to bridge the 
increasing gap between fast processors and slower main memory. Until recently, cache 
memories were thought to impose unpredictable execution time behavior for hard real- 
time systems. But recent results show that the speedup of caches can be exploited 
without a significant sacrifice of predictability. These results were obtained under the 
assumption that real-time tasks be scheduled non-preemptively.This paper ... 
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Cache miss stalls hurt performance because of the large gap between memory and 
processor speeds - for example, the popular server benchmark SPEC JBB2000 spends 
45% of its cycles stalled waiting for memory requests on the Itanium® 2 processor. 
Traversing linked data structures causes a large portion of these stalls. Prefetching for 
linked data structures remains a major challenge because serial data dependencies 
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We observe a non-negligible fraction— 3 to 16% in our benchmarks— of dynamically dead 
instructions, dynamic instruction instances that generate unused results. The majority of 
these instructions arise from static instructions that also produce useful results. We find 
that compiler optimization (specifically instruction scheduling) creates a significant portion 
of these partially dead static instructions. We show that most of the dynamically 
instructions arise from a small set of st ... 
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High-performance processors use a large set-associative LI data cache with multiple 
ports. As clock speeds and size increase such a cache consumes a significant percentage 
of the total processor energy. This paper proposes a method of saving energy by reducing 
the number of data cache accesses. It does so by modifying the Load/Store Queue design 
to allow "caching" of previously accessed data values on both loads and stores after the 
corresponding memory access instruction has been committed. It ... 
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Highly aggressive multi-issue processor designs of the past few years and projections for 
the decade, require that we redesign the operation of the cache memory system. The 
number of instructions that must be processed (including incorrectly predicted ones) will 
approach 16 or more per cycle. Since memory operations account for about a third of all 
instructions executed, these systems will have to support multiple data references per 
cycle. In this paper, we explore reference stream characterist ... 

Keywords: Locality-Based Interleaving, Multiporting, High-Bandwidth Data Supply, Multi- 
Bank Caches 
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In the context of portable embedded systems, reducing energy is one of the prime 
objectives. Most high-end embedded microprocessors include onchip instruction and data 
caches, along with a small energy ef.cient scratchpad. Previous approaches for utilizing 
scratchpad did not consider caches and hence fail for the au courant architecture. In 
thepresented work, we use the scratchpad for storing instructions and propose a generic 
Cache Aware Scratchpad Allocation (CASA) algorithm. We report an aver ... 

13 Access region locality for high-bandwidth processor memory system design 
Sangyeun Cho, Pen-Chung Yew, Gyungho Lee 
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Publisher: IEEE Computer Society 
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This paper studies an interesting yet less explored behavior of memory access 
instructions, called access region locality. Unlike the traditional temporal and spatial data 
locality that focuses on individual memory locations and how accesses to the locations are 
inter-related, the access region locality concerns with each static memory instruction and 
its range of access locations at run time. We consider program's data, heap, and stack 
regions in this paper. Our experimental study ... 
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May 1995 ACM SIGARCH Computer Architecture News , Proceedings of the 22nd 

annual international symposium on Computer architecture ISCA '95, volume 
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For many programs, especially integer codes, untolerated load instruction latencies 
account for a significant portion of total execution time. In this paper, we present the 
design and evaluation of a fast address generation mechanism capable of eliminating the 
delays caused by effective address calculation for many loads and stores.Our approach 
works by predicting early in the pipeline (part of) the effective address of a memory 
access and using this predicted address to speculatively access the ... 
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^ Gokhan Memik, William H. Mangione-Smith 

October 2002 Proceedings of the 2002 international conference on Compilers, 

architecture, and synthesis for embedded systems CASES '02 
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We propose and evaluate a data filtering method to reduce the power consumption of 
high-end processors with multiple execution cores. Although the proposed method can be 
applied to a wide variety of multi-processor systems including MPPs, SMPs and any type 
of single-chip multiprocessor, we concentrate on Network Processors. The proposed 
method uses an execution unit called Data Filtering Engine that processes data with low 
temporal locality before it is placed on the system bus. The execution co ... 

Keywords: chip multiprocessors, data locality, network processors, power reduction, 
remote procedure call 
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We examine the ability of CMPs, due to their lower onchip communication latencies, to 
exploit data parallelism at inner-loop granularities similar to that commonly targeted by 
vector machines. Parallelizing code in this manner leads to a high frequency of barriers, 
and we explore the impact of different barrier mechanisms upon the efficiency of this 
approach. To further exploit the potential of CMPs for fine-grained data parallel tasks, we 
present barrier filters, a mechanism for fast barrier sy ... 
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■jm dynamic software optimization system 
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June 2007 Proceedings of the 21st annual international conference on 
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Software or hardware data cache prefetching is an efficient way to hide cache miss 
latency. However effectiveness of the issued prefetches have to be monitored in order to 
maximize their positive impact while minimizing their negative impact on performance. In 
previous proposed dynamic frameworks, the monitoring scheme is either achieved using 
processor performance counters or using specific hardware. In this work, we propose a 
prefetching strategy which does not use any specific hardware com ... 

Keywords: binary instrumentation, data cache prefetching, dynamic optimization 
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Modern embedded microprocessors use low power on-chip memories called scratch-pad 
memories to store frequently executed instructions and data. Unlike traditional caches, 
scratch-pad memories lack the complex tag checking and comparison logic, thereby 
proving to be efficient in area and power. In this work, we focus on exploiting scratch-pad 
memories for storing hot code segments within an application. Static placement 
techniques focus on placing the most frequently executed portions of programs ... 
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A parallel simulator, PSIMUL, has been used to collect information on the memory access 
patterns and synchronization overheads of several scientific applications. The parallel 
simulation method we use is very efficient and it allows us to simulate execution of an 
entire application program, amounting to hundreds of millions of instructions. We present 
our measurements on the memory access characteristics of these applications; 
particularly our observations on shared and private data, their ... 
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March 2006 Proceedings of the International Symposium on Code Generation and 
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Full text available: |p pybjj^her Site Additional Information: full citation, abstract, citings, index terms 

Most programs are repetitive, where similar behavior can be seen at different execution 
times. Algorithms have been proposed that automatically group similar portions of a 
program's execution into phases, where samples of execution in the same phase have 
homogeneous behavior and similar resource requirements. In this paper, we present an 
automated profiling approach to identify code locations whose executions correlate with 
phase changes. These "software phase markers" can be used to easily dete ... 
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High-performance out-of-order processors spend a significant portion of their execution 
time on the incorrect program path even though they employ aggressive branch 
prediction algorithms. Although memory references generated on the wrong path do not 
change the architectural state of the processor, they can affect the arrangement of data 
in the memory hierarchy. This paper examines the effects of wrong-path memory 
references on processor performance. It is shown that these references significant! ... 

Keywords: cache pollution, data prefetching, execution-driven simulation, processor 
performance analysis, speculative execution, wrong path modeling, wrong-path memory 
references 
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This study has been carried out in order to determine cost-effective configurations of 
functional units for multiple-issue out-of-order superscalar processors. The trace-driven 
simulations were performed on the six integer and the fourteen floating-point programs 
from the SPEC 92 suite. We first evaluate the number of instructions allowed to be 
concurrently processed by the execution stages of the pipeline. We then apply some 
restrictions on the execution issue of different instruction classes i ... 
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As pipeline width and depth grow to improve performance, memory arrays in 
microprocessors are growing in entries and ports. Arrays will increase in physical size, 
which prolongs the access time due to wiring delay. In order to boost clock frequency, 
these memory arrays must take multiple cycles to complete an access. This delays the 
scheduling of dependent instructions and affects overall performance. This paper proposes 
a different circuit organization to enable fast and slow accesses solely de ... 
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Advances in technology have allowed portable electronic devices to become smaller and 
more complex, placing stringent power and performance requirements on the devices 
components. The M7CORE M3 architecture was developed specifically for these embedded 
applications. To address the growing need for longer battery life and higher performance, 
an 8-Kbyte, 4-way set-associative, unified (instruction and data) cache with pro- 
grammable features was added to the M3 core. These features allow the a ... 
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Modern processors can issue and execute multiple instructions per cycle, often performing 
multiple memory operations simultaneously. To reduce stalls due to resource conflicts, 
most processors employ multi-ported LI caches and TLBs to enable concurrent memory 
accesses. In this paper, we observe that data TLB lookups within a cycle and across 
consecutive cycles are often synonymous — they go to the same page. To exploit this 
finding, we propose two new mechanisms — intra-cycle compa ... 

Keywords: low-power TLB, multi-porting, spatial and temporal locality 
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Chip Multiprocessors (CMP) and Simultaneous Multi-threading (SMT) are two approaches 
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that have been proposed to increase processor efficiency. We believe these two 
approaches are two extremes of a viable spectrum. Between these two extremes, there 
exists a range of possible architectures, sharing varying degrees of hardware between 
processors or threads. This paper proposes conjoined-core chip multiprocessing - 
topological^ feasible resource sharing between adjacent cores of a chip multiprocess ... 
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The growing class of portable systems, such as personal computing and communication 
devices, has resulted in a new set of system design requirements, mainly characterized by 
dominant importance of power minimization and design reuse. We develop the design 
methodology for the low power core-based real-time system-on-chip based on 
dynamically variable voltage hardware. The key challenge is to develop effective 
scheduling techniques that treat voltage as a variable to be determined, in additio ... 
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Scratchpad Memories (SPMs) are commonly used in embedded systems because they are 
more energy-efficient than caches and enable tighter application control on the memory 
hierarchy. Optimally mapping code and data to SPMs is, however, still a challenge. This 
paper proposes an optimal scratchpad mapping approach for code segments, which has 
the distinctive characteristic of working directly on application binaries, thus requiring no 
access to either the compiler or the application source code - a c ... 

Keywords: design automation, dynamic programming, embedded design, executable 
patching, memory hierarchy, optimization algorithm, post-compiler processing, power 
saving, scratchpad memory 
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